HW Overview

Please complete all core assignment tasks. Optional tasks do not carry any points but are highly recommended.

The goals of this HW include the following:

Please consult Canvas for the grading rubric (check Module 05 -> HW 05).

Linear Regression Basics

Batch Gradient Descent

Use the starter code below to calculate the first iteration of gradient descent to minimize the objective function $2x^2+ 4x$. We assume that the initial guess is $x=2$ of  and a stepsize of 0.1

Stochastic Gradient Descent

Assume you are learning a linear regression model (using the mean squared error objective loss function) via stochastic gradient descent with the following setting:

Linear Regression

Given a linear regression model with parameters θ where $θ_0$ corresponds to the bias term:

Boston House Prices

We have seen this dataset previously while working with KNN Regression. In this notebook, we're going to build a different regression model for predicting house prices in thousands of dollars given factors such as crime rate in neighborhood, number of schools, % lower status of the population, etc.

Import required libraries

Fix random seed for reproducibility

Reading data

Boston dataset is extremely common in machine learning experiments thus it is embedded in sklearn.

Detailed description of dataset and features

Create pandas dataframe with objects in rows and features in columns

Exploratory data analysis (EDA)

All features are numerical, but note that some features are categorical (e.g., CHAS) while others are continuous.

Scatterplot and Histograms

We will start by creating a scatterplot matrix that will allow us to visualize the pair-wise relationships and correlations between the different features. It is also quite useful to have a quick overview of how the data is distributed and wheter it cointains or not outliers.

DataFrame.copy(self, deep=True)[source]

DataFrame.copy(self, deep=True)[source]

Make a copy of this object’s indices and data.

When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

We can spot a linear relationship between ‘RM’ and House prices ‘MEDV’. In addition, we can infer from the histogram that the ‘MEDV’ variable seems to be normally distributed but contain several outliers.

Let's also take a look into correlation matrix of features

Preprocessing

Splitting the data (train/test)

Let's split data into a train and test subsets so we can avoid ROTE learning.

There are lots of feature. Let's visualize two of them across the train and test data.

Scaling

Normalize data in the range $(0,1)$ to make our data insensitive to the scale of features.

Note that we're going to learn normalization constants only on training set. That's done because the assumption is that test set is unreachable during training.

Transform test set with the same constants

Plot distributions of each input variable

Sklearn Linear Regression

Fitting

Here we use very simple Linear Regression model. Scikit-learn uses the closed-form solition for Linear Regression problem thus it gives very good results.

Fitting model on prepared data

Evaluation

Let's see what features are significant for the model. Largest coefficients will have greatest impact on the model.

Predicting both train and test sets to evaluate model

Mean absolute percentage error (MAPE)

There is no MAPE implementation in sklearn (because this metric is undefined when real value is zero). Below one can find my own implementation.

The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics, for example in trend estimation. It usually expresses accuracy as a percentage, and is defined by the formula:

$${\displaystyle {\mbox{M}}={\frac {100\%}{n}}\sum _{t=1}^{n}\left|{\frac {A_{t}-F_{t}}{A_{t}}}\right|,} $$

where At is the actual value and Ft is the forecast value.

The difference between At and Ft is divided by the Actual value At again. The absolute value in this calculation is summed for every forecasted point in time and divided by the number of fitted points n. Multiplying by 100% makes it a percentage error.

Although the concept of MAPE sounds very simple and convincing, it has major drawbacks in practical application

It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero. For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error. When MAPE is used to compare the accuracy of prediction methods it is biased in that it will systematically select a method whose forecasts are too low. This little-known but serious issue can be overcome by using an accuracy measure based on the ratio of the predicted to actual value (called the Accuracy Ratio), this approach leads to superior statistical properties and leads to predictions which can be interpreted in terms of the geometric mean.

For more details on MAPE please see here

Let's evaluate our model according to three different metrics:

Also we want to check quality on both train and test sets

Let's do it in loop

Evaluated metrics:

It also interesting to take a look how the predicted points relate to real ones. All the points should lie on the black dotted line ($y=x$) assuming that our model is perfect

Cross-validation

The common method to evaluate the model is cross-validation. The idea behind it is to divide the whole set of objects into $k$ sections and then use one section as a test set and other $k-1$ as a train (repeat it with all the sections).

There is a special function for this in sklearn called $\text{KFold}$. It creates set of indices for cross-validation.

Next step is to do everything that we've done before in a loop:

And store the average value of the errors ($\text{res}$ variable)

Here is the result of CV

Core Task: Boston Regression via pipelines

Using the Boston dataset from above (train, and blind test split).

Core Assignment: Cross-validation with k=10 and stdev.

Repeat the above experiment but change the code so that you can compute the standard deviation of the various metrics of interest (i.e., report MAE and stdMAE, RMSE and stdRMSE etc.) via cross-validation with k=10.

Core Task: Predicting Bike Sharing Demand (Kaggle competition)

The question provides a lot of background and sample code that you should review, run and extend to get familar with this problem and data set. Then you should tackle the tasks posed at the end of this question.

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world [as of 2014].

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

Overview on the project:

Load Data

EDA: Understanding the bike demand data det and doing EDA

Understanding the Data Set The dataset shows hourly rental data for two years (2011 and 2012). The training data set is for the first 19 days of each month. The test dataset is from 20th day to month’s end. We are required to predict the total count of bikes rented during each hour covered by the test set.

In the training data set, they have separately given bike demand by registered, casual users and sum of both is given as count.

Training data set has 12 variables (see below) and Test has 9 (excluding registered, casual and count).

Independent Variables

datetime:   date and hour in "mm/dd/yyyy hh:mm" format
season:     Four categories-> 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday:    whether the day is a holiday or not (1/0)
workingday: whether the day is neither a weekend nor holiday (1/0)
weather:    Four Categories of weather
            1-> Clear, Few clouds, Partly cloudy, Partly cloudy
            2-> Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
            3-> Light Snow and Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
            4-> Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp:       hourly temperature in Celsius
atemp:      "feels like" temperature in Celsius
humidity:   relative humidity
windspeed:  wind speed
Dependent Variables

registered: number of registered user
casual:     number of non-registered user
count:      number of total rentals (registered + casual)

Notice the Training dataset has 3 target features:

registered: number of registered user
casual:     number of non-registered user
count:      number of total rentals (registered + casual)

Here we will focus on predicting count feature initially.

Check if there is any missing value.

Plot the feature distributions (including the targets: count, casual, registered)

Pandas and Dataframes

In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. Data is stored in tabular format. In particular, it offers data structures and operations for manipulating numerical tables and time series.

The Panda Library features include:

For more info on Pandas dataframes see the following:

image.png

GROUP BY analysis: Sample use of a GROUP BY analysis: E.g., Season, or Holiday

E.g.,

Visualize a categorical features like Season, weather

weather:    Four Categories of weather
            1-> Clear, Few clouds, Partly cloudy, Partly cloudy
            2-> Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
            3-> Light Snow and Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
            4-> Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

One hot encoding (OHE)

Features like season, holiday, weather and working day are in numerical form here. Having said that the numerical values associated with feature like weather dont have any particular numeric ordering. As such, it is much better to one hot encode these features. Here we can do this manually (via code) for training and testing datasets. Later in this course we automate this process in a more principled manner (we have to deal with corner cases.

Dont forget to OHE the test data

Dont forget to OHE the test data!! ..........

Are there other features that should be OHE?

Hint: Season? Others?

Create new features (based on timestamp)

Separate input features and target feature for TRAIN ONLY.

Why separate input features and target feature for TRAIN only?

Answer: The test set is for creating a Kaggle submission.

Analysis of target variables

It was seen from the training data sum of registered column and casual column yields count. It was unnecessary to keep these two columns as our features, Machine learning learners can be more fruitful if dataset is free of useless columns.

Applying machine learning models (i.e., linear regression, DTs)

Split the training data into train and a blind test set

Split the training data into train and a blind test set using scikit's train_test_split package. Rememeber that the test set that was provided is for Kaggle submission only and as such has not target values.

Features on larger scales can unduly influence the model. We want features on a similar scale. Scikit's preprocessing provides us with StandardScaler package to scale our data.

Keep a log book to track your experiments

Learn a Decision Tree Regressor

Let's learn a DecisionTreeRegressor model for comparison purposes.

Learn a Decision Tree ensemble, i.e., use RandomForestRegressor

Learn Decision Tree ensemble, RandomForestRegressor, model.

Generate a submission file on the Kaggle test data.

Dont forget to generate features

Similarly, use same standarad scaler for test data

Clip all negative predictions at zero

This is a post processing step to insure that all negative prediction values (we can not have negative bike rentals) are clipped to the value 0.

How many negative counts does your model predict? Plot a histogram using the code below.

Now upload this submission file to Kaggle

Good luck!

Sample of submission screenshot:

image.png

Core TASK: Report your results

This question (this entire section) so far has provided a lot of background and sample code that you should review, run and extend to get familar with this problem and data set. Next up you should tackle the tasks posed here. Please feel free to adapt the above code to accomplish the following:

Overall, we want you to build a linear regression model using pipelines that builds on what was discussed earlier in this section and do a Kaggle submission for this Bike Demand prediction problem. In particular, we want you to address the following:

Reload original train data

OHE encoding on season feature (train)

Separate x, y data and the training data into a train and validation sets

SelectKBest feature selection (using the TopFeatureSelector method from Lab 04)

Build a full pipeline using Linear Regression

Fit train data and generate predictions on the validation data

That's a little more realistic, but still doesn't appear to be distributed like our ground truth.

Generate statistics on the predictions.

Prepare a Kaggle submission using our pipeline model

Re-load the Kaggle test data

Perform the same OHE steps as we did on training data

Core task: bike demand extension 1 (2 separate LR models)

Build separate linear regression models for the other target variables (instead of the count target variable) and report your results:

registered: number of registered user casual: number of non-registered user

Overall, we want you to build a linear regression model using pipelines that builds on what you learned as a solution to the previous section, and do a Kaggle submission for this Bike Demand prediction problem. In particular, we want you to address (using what you have done previously) the following:

Load data

Reload training data

OHE for wather and season; generate time and date features

Separate x, y data and perform train_test_split on training data only

Select K best features registered and casual targets

Fit train data and generate predictions on the "test" split of training data

Because we used a pipeline, we can easily generate predictions on two targets using the same model

First using Kbest features for registered target

Last two features have p values greater than 0.01, so they will be excluded

Then using Kbest features for casual target

Report results

MAPE is undefined for registered and casual models because each includes zeros among the actual (true) target values.

Core task: bike demand extension 2 (target = log(demand+1))

The distribution of the raw count feature is very skewed with most values occuring to the left of the count domain. To make this distribution less skewed (and more normal distributed, it is common to take the log of the value + 1 (to avoid log(0)).

Overall, we want you to build a linear regression model (single model will suffice based on the count target) using pipelines that builds on what you learned as a solution to the previous section, and do a Kaggle submission for this Bike Demand prediction problem. In particular, we want you to address (using what you have done previously) the following:

Load data

Reload training data

OHE for wather and season; generate time and date features

Prepare the train data

Separate x, y from the train data and log transform y

Note that zeros no longer dominate other target values in transformed data

Sequential Backward Selection

Beyond 5 features there is not much additional benefit in terms of explained variance.

Modify TopFeatureSelector to select the k-th subset our SBS object

Build a regression model (with a Pipeline)

Generate predictions

Present results

Prepare a Kaggle submisison

Load and preprocess Kaggle test data

Generate predictions using pipeline and best features (as determined by SBS) only

Generate predictions

Transform back to original scale and integer values (counts)

Present Kaggle predictions in the format required for submission

Core task: bike demand extension 3 (LR + Regularization)

L1 and L2 regularization for Linear Regression

Regularization is a way of penalizing the model for excessive complexity, and this helps reduce the risk of overfitting.

There are many ways of doing regularization but these two are the major ones:

Use SKLearn Ridge and LASSO linear regression classes to perform feature selection. Use gridsearch to determine best value for $\alpha$ the mixing coefficent associated with the penalty term in regularized linear regression.

Reload and restore the training data

Build GridSearch pipeline to compare alpha combinations using Ridge and Lasso

Generate predictions using best combination and report results in table

Prepare a Kaggle submission

Load and preprocess Kaggle test data

Generate predictions using best regularization pipeline

Why didn't we use feature selection for this problem?

pipeline.named_steps

Get access to the internals of the pipeline using the following:

pipe.named_steps.estimator.coef_

Core Task: Diabetes Progression Regression (with pipeline-based derived features and feature selection)

This section provides all the data and code. Please run all cells in this section and respond to the questions included here. Feel free to modify the code once you have completed the tasks, amd experiment with new settings.

Introducing the problem dataset

Predict diabetes progression one year after baseline using patient information and tests such as:

Load the dataset

Exploratory data analysis

Perform EDA through visualizing the data and also do other statistical analysis.

Note that we shift the X data to make it all positive. It doesn't affect the results, but some of the subsequent algorithms only work with positive data

Let's also take a look into correlation matrix of features

Compute the correlation matrix between all inputs and the output

Visualize the data pairwise using the seaborn's pairplot()

Split data into Train and Test

Split data into 70% training and 30% test data. GridSearch uses crossfold validation. Therefore we do not need a validation set.

Feature engineering

At this point we could do some feature engineering. For example, we could create a new area feature based upon taking the product of the length and width features if a problem had such features (e.g., see the iris flower classification problem). We would view this feature as a derived feature based upon the interaction of two base input features.

Next we look at how to generate interaction features and polynomial features using SKLearn's PolynomialFeatures().

Polynomial and Interation Features

Here we explore datasets with polynomial features and interaction features.

First of all, we can generate such a dataset using sklearn, this code will demonstrate how to use it. (You can also use this link to check the user guide)

from sklearn.preprocessing import PolynomialFeatures
x = np.arange(4).reshape(2, 2)
poly = PolynomialFeatures(2)
print (x)
poly.fit_transform(x)

### outputs the following:
[[0 1]
 [2 3]]
array([[1., 0., 1., 0., 0., 1.],
       [1., 2., 3., 4., 6., 9.]])

Here using the original dataset of the form $(x_1,x_2)$ = [[0,1], [2,3]], in combination with PolynomialFeatures(2) yields a new dataset $(1,x_1,x_2,x_1^2,x_1x_2,x_2^2)$ = (1,0,1,0,0,1), (1,2,3,4,6,9)

Train a ridge linear regression model using derived features

The General Pipeline Interface

The Pipeline class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps (at least!). Similarly, the last step could be regression or clustering instead of classification.

The only requirement for estimators in a pipeline is that all but the last step need to have a transform method, so they can produce a new representation of the data that can be used in the next step. Internally, during the call to Pipeline.fit(), the pipeline calls fit and then transform on each step in turn(or just fit_transform()), with the input given by the output of the transform method of the previous step. For the last step in the pipeline, just fit() is called.

Heatmap plotting helper function

Grid-searching: preprocessing, polyFeatures, and modeling

Question: polynomial features

Do the polynomial features help with to the test error?

Question: mean_absolute_error

What is the mean absolute error on the test dataset for this pipeline? Report to two decimal places.

Simple pipeline with original features only

Just crosschecking the pipeline with original features only

Pipeline-based derived features and feature selection

Question: number of features

Use selectpercentile for feature selection, what is the best percentile using this pipeline?

Question: mean_absolute_error

What is the mean absolute error on the test dataset for this pipeline? Report to two decimal places.

Core Task: Linear Regression with custom dataset

Load Data

Consider the provided dataset "1.csv". The datas contains 5 columns, $x_1, x_2, x_3, x_4$, and $y$. We know that the datas were generated by the following function $y=\sum_{i=1}^4 w_i x_i+b$, where $w_1, \ldots, w_4,b$ are constants. However, we do not know the coefficients, and the datas are noisy. Let us implement linear regression to find out $w_1, \ldots, w_4,b$.

Data Augmentation

Then, we implement data augmentation as shown in video 4.5 and solve for the closed-form solution for linear regression

Linear Regression

Implement Closed-Form Solution

Question: what is the value of bias term? Please report up to 3 decimal places.

Calculate MSE

Question: what is MSE for this given dataset? Please report up to 3 decimal places.

Using SKlearn

Let's try the linear regression tool from sklearn, is the solution similar to yours?

Here we can see that the original function could be $y=3x_1+x_4+2$

Polynomial and Interation Features

Part1 of this questions shows a linear relation between the inputs and the outputs. However this is usually not the case for real-life applications. In this part, we will explore datasets with polynomial features and interaction features.

First of all, we can generate such a dataset using sklearn, this code will demonstrate how to use it. (You can also use this link to check the user guide)

Here the original dataset is $(x_1,x_2)$ = (0,1), (2,3), and we generated a new dataset $(1,x_1,x_2,x_1^2,x_1x_2,x_2^2)$ = (1,0,1,0,0,1), (1,2,3,4,6,9)

Load the Second Dataset

Now consider a dataset "2.csv" similar to part1, generated by the function $y=\sum_{i=1}^3 w_i x_i+\sum_{i=1}^3 v_i x_i^2+\sum_{i,j=1(i\ne j)}^3 u_{ij} x_i x_j + b$.

Polynomial Regression

Is this new function solvable by linear regression? The answer is yes. The only thing we have to do is to generate polynomial and interaction features.

Now, we know these are the coefficients of the original function, let's see if we can recover them from all the derived features (interactions and polynomials). Run the following code and respond to the questions following the code.

Understanding this model

Ok, so we just used grid search to search for the best model/parameter combination for this model. Let's take a look and understand what it means.

First, let's see what is inside this pipeline.

Order of Coefficients

To figure out which feature was dropped, we print out the order of coefficient, then look at the p_values of feature selection step.

Does the pipeline identify select the raw features only? Yes or no?

Optional Task: Linear Regression via GD from scratch

The objective function of linear regression is the following: $$ f(\mathbf{w}, b) = \frac{1}{n}\sum_{i=1}^{n}\left[ (\mathbf{w}\cdot\mathbf{x}_i + b) - y_i\right]^2,\\ n = \left|X_{\text{train}}\right| $$ So we want to minimize the squared value of difference between predictions and real answers. It is called Mean Squared Error (MSE). Gradient Descent is the way of optimizing this complex functional and tune weigths $\mathbf{w}$ and bias $b$.

To be able to treat weigths $\mathbf{w}$ and bias $b$ homogeneously we're going to augment the data with the "shell" feature (all $1$'s). Then we can add one more parameter to the weight vector and treat it as a bias. $$ \mathbf{x}' := \begin{bmatrix} \mathbf{x}\\ 1 \end{bmatrix},\quad \boldsymbol{\theta} := \begin{bmatrix} \mathbf{w}\\ b \end{bmatrix} \\ f(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^{n}\left[ \boldsymbol{\theta}\cdot\mathbf{x}'_i - y_i\right]^2 $$ In this way it is much more easier to carry out oprimization process.

To simplify it further and do it in "tensor" way let's rewrite it in matrix form. Let's introduce data matrix (the same as dataframe we used everywhere above)

$$ \text{X}' = \begin{bmatrix} \mathbf{x'}_1^{\text{T}}\\ \vdots\\ \mathbf{x'}_n^{\text{T}} \end{bmatrix},\quad \mathbf{y} = \begin{bmatrix} y_1\\ \vdots\\ y_n \end{bmatrix} $$

Matrix $\text{X}$ contains objects in its rows and features in its columns. Vector $\mathbf{y}$ is a vector of answers. Then the objective can be rewritten as follows:

$$ f(\boldsymbol{\theta}) = \frac{1}{n}\|\text{X}'\cdot \boldsymbol{\theta} - \mathbf{y}\|_2^2 $$

Then the gradient can be easily calculated in vectorized form:

$$ \nabla_{\boldsymbol{\theta}} f(\boldsymbol{\theta}) = \frac{2}{n}\,\text{X}'^{\text{T}}\left(\text{X}'\cdot \boldsymbol{\theta} - \mathbf{y}\right) $$

Exactly this computations are implemented down below in BasicLinearRegressionHomegrown class

Data

Split into train and test set (with the same $\text{random_state}$ which means we can compare results)

Scaling

Basic version of homegrown Linear Regression

Create model

Fitting

Evaluation

Optional Assignment: Random Search

Random search algorithm consists of the following steps:

  1. Sample a set of weigths from some distribution. Here we're going to use Uniform distribution. $$ \boldsymbol{\Theta} = \{\boldsymbol{\theta}_1, \boldsymbol{\theta}_2 \ldots \boldsymbol{\theta}_{N}\} $$
  2. Now we have a set of weights $\boldsymbol{\Theta}$ for Linear Regression. The idea is to choose the best one according to the objective. $$ \boldsymbol{\theta^*} = \underset{\boldsymbol{\Theta}}{\text{argmin}} \sum_{i=1}^{n}\left[\boldsymbol{\theta} \cdot \mathbf{x_i} - y_i\right]^2 $$

Create model

Fitting

Evaluation

Optional Assignment: Numerical Calculation

The formula for analytical gradient (from calculus):

$$ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \vdots\\ \frac{\partial f}{\partial x_m} \end{bmatrix}, \text{ where } m \text{ is the space dimension}\\ \frac{\partial f}{\partial x_1} = \lim_{\alpha \rightarrow 0} \frac{f(x_1 + \alpha, x_2 \ldots x_m) - f(x_1, x_2 \ldots x_m)}{\alpha} $$

For sufficiently small $\alpha$ one can approximate partial derivative by simple throwing out the limit operator

$$ \frac{\partial f}{\partial x_1} \approx \frac{f(x_1 + \alpha, x_2 \ldots x_m) - f(x_1, x_2 \ldots x_m)}{\alpha} = \left( \frac{\partial f}{\partial x_1} \right)_{\text{num}}\\ $$

Then the final approximation of the gradient is:

$$ \nabla f(\mathbf{x}) \approx \nabla_{\text{num}\,\,} f(\mathbf{x}) = \begin{bmatrix} \left( \frac{\partial f}{\partial x_1} \right)_{\text{num}}\\ \vdots\\ \left( \frac{\partial f}{\partial x_m} \right)_{\text{num}} \end{bmatrix} $$

The common way of measuring the difference between vectors is the following: $$ \text{er} = \frac{\|\nabla f(\mathbf{x}) - \nabla_{\text{num}\,\,}f(\mathbf{x})\|_2^2}{\|\nabla f(\mathbf{x})\|_2^2} = \frac{\sum_{j=1}^{m}\left(\nabla^j f(\mathbf{x}) - \nabla^j_{\text{num}\,\,}f(\mathbf{x})\right)^2}{\sum_{j=1}^{m}\left(\nabla^j f(\mathbf{x})\right)^2} $$

Create model

Fitting

Plotting error curves

Optional Assignment: Stochastic Gradient Descent

In Full GD we do a descent step only after the calculation of the gradient over the whole set of data. In this case the gradient is precise and gives the best possible direction. But it can require quite a lot of time if we have huge amounts of data.

In practice we can get faster convergence if we calculate the gradient not over the whole set of data but over the small (size of $B$) batch of it.

$$ \nabla f(\boldsymbol{\theta}) \approx \nabla_{\text{batch}\,\,} f(\boldsymbol{\theta}) = \frac{2}{n}\sum_{i=1}^{B}\left(\mathbf{x}'_{a_i}\cdot \boldsymbol{\theta} - y_{a_i}\right)\cdot \mathbf{x}'_{a_i} $$

where $a_i$ is an array of indices of objects which are in this batch. Common approach here that you should use is to shuffle samples randomly and then iterate over them with batches.

So with this batch approach we get an approximation of the real gradient in point $\boldsymbol{\theta}$. This approximation is very cheap and fast to compute (usually $B$ is not too big $-$ from 32 to 256). After obtaining this gradient we do a descent step in this approximate direction and proceed to the next stage of batch descent.

Create model

Fitting

Evaluation

L1 and L2 regularization from scratch

Incorporate L1 and L2 regularization for the BasicLinearRegressionHomegrown class developed above. Start with L2 regularization.

Optional Assignment: Adaptive step size [no bonus points for this question]

Line Search of the step size

Instead of doing a gradient step with the fixed step size ($\alpha=0.0005$) consider it as a variable after choosing the step directon (gradient one) and try to optimize it. In other words solve analyticaly the following 1D optimization problem:

$$ f\left(\boldsymbol{\theta}_{t} - \alpha \cdot \nabla f(\boldsymbol{\theta}_{t})\right) \rightarrow \min_{\alpha} $$

Modify your GD model to use adaptive step size